Hierarchical network analysis of co-occurring bioentities in literature

Biomedical databases grow by more than a thousand new publications every day. The large volume of biomedical literature that is being published at an unprecedented rate hinders the discovery of relevant knowledge from keywords of interest to gather new insights and form hypotheses. A text-mining tool, PubTator, helps to automatically annotate bioentities, such as species, chemicals, genes, and diseases, from PubMed abstracts and full-text articles. However, the manual re-organization and analysis of bioentities is a non-trivial and highly time-consuming task. ChexMix was designed to extract the unique identifiers of bioentities from query results. Herein, ChexMix was used to construct a taxonomic tree with allied species among Korean native plants and to extract the medical subject headings unique identifier of the bioentities, which co-occurred with the keywords in the same literature. ChexMix discovered the allied species related to a keyword of interest and experimentally proved its usefulness for multi-species analysis.

www.nature.com/scientificreports/ comparable binding pockets, the types of backbones or derivatives help inspect their physicochemical properties and/or biological roles. Moreover, annotation methods are introduced to classify the structural types of chemicals in chemicals profiling studies [17][18][19] . Even in the case of protein targets, the expression of genes helps understand the physiological function of proteins, as well as to identify related physiological disorders and pathologies 20,21 .
ChexMix was developed to extract the bioentities based on the literature keywords, as well as the keywords entered by researchers (Fig. 1). Therefore, ChexMix was designed to help organize biomedical data into hierarchical knowledge based on topological similarities between bioentities. The co-occurrence of biomedical terms provides the assumption that the bioentities in the same abstract or full-text can be considered to be biologically or chemically related to each other 2 . These associations can then be visualized as network graphs or hierarchical trees, and be more easily analyzed for uncovering hidden insights from already existing knowledge.

Results and discussion
ChexMix was designed for the extraction of hierarchical and topological information related to bioentities. Therefore, ChexMix extracts the bioentities that co-occur with the keywords queried in PubMed and encodes into unique identifiers indexing their related information. The combination of a hierarchical representation with a mapping of bioentities to identifiers at each level allows the relationships between them to be organized and cross-referenced. For example, species resulting from keywords of interest, such as chemicals or diseases, can be hierarchically represented from the highest rank, 'cellular organisms' , according to the phylogenetic taxonomic system of the NCBI taxonomy database 16 . The search results are arranged according to the hierarchical characteristics of each bioentities and can be displayed in plots for hierarchical data visualization or nested lists (Fig. 1); therefore, the information can be useful for the inspection of related information among keywords of interest. Herein, ChexMix was applied to discover the biomedical sources of natural products that produce the bioactive compound, amentoflavone, which holds a wide range of biological activities, including antioxidative, anti-inflammatory, anticancer, antiviral, and antifungal properties 22 . This compound also shows potent antisenescence activity against ultraviolet B irradiation-induced skin aging, preventing nuclear aberrations 23 ; thus, it can be used for the prevention of skin aging in the cosmetic industry.
Firstly, 319 bioentities were extracted from ChexMix using the keyword 'amentoflavone' under the highest taxonomic rank, 'cellular organisms' (Fig. 2). Among them, 223 species comprised in the Viridiplantae (literally 'green plants') clade were targeted. It was possible to verify that those species co-occurred with amentoflavone in the same study and investigate whether a plant species could produce amentoflavone (Supplementary Table S1).
To avoid duplicated studies and find novel bioactive sources, the analysis was focused on the allied species belonging to the Viburnum genus, retrieving 19 samples of different parts of eight species native to Korea that were not previously studied on amentoflavone-related topics (Fig. 3, Supplementary Table S2). Next, the existence of amentoflavone was evaluated in samples of these plants and quantified by HPLC. The presence of amentoflavone was confirmed by its isotopic peak at 537.4 m/z [M + H] − detected by liquid chromatography-mass spectrometry. Among them, the leaves of V. erosum contained the highest amount of amentoflavone (7.39 mg/g) compared with Selaginella tamariscina, which is the representative natural ingredient for anti-wrinkle effect and the major source of amentoflavone in the cosmetic industry 24 . Overall, the summarization of hierarchical bioentities information using ChexMix is expected to help inspect massive sparse bioentities in databases in future investigations.
The performance of the results from Chexmix was quantitatively evaluated based on the extracted bioentities using a set of keywords which are associated with the original keyword 'amentoflavone' . 243 networks of taxonomies were obtained using ChexMix from MeSH terms of chemicals co-occurred with 'amentoflavone' in the literature, and they were analyzed by the basic network properties and similarity metrics (Supplementary Table S3). The similarity metrics compared each of the 243 networks with the network of 'amentoflavone' , where the number of true positives was calculated by the number of common nodes in both of the networks (Supplementary Table S3).
Additionally, ChexMix can also integrate the results from multi-keywords. The MeSH identifiers for bioentities co-occurring with the keywords of interests could be used for connecting the results by two different queries (Fig. 4). For instance, two species names, Taxus cuspidata and Podophyllum peltatum, were queried by ChexMix www.nature.com/scientificreports/ and generated two small networks consisting of bioentities with MeSH identifiers extracted from PubTator. It was possible to inspect the co-occurred bioentities among the MeSH identifiers in the integrated network. The network of each species showed different MeSH identifier profiles and MeSH identifiers related to 'cancer' , in particular 'ovarian neoplasms' , co-occurred. This agrees with the fact that paclitaxel of T. cuspidata and podophyllotoxin of P. peltatum are well-known potent anticancer drugs for ovarian cancer [25][26][27] .
Here, a usage scenario of ChexMix to alleviate the complex task of compiling large data by narrowing down the scope of bioentities or grouping similar bioentities using the hierarchical relationships was described. Firstly, to obtain the appearance counts of bioentities in literature queried by keywords of interest, ChexMix collects PubMed and PMC literature followed by fetching annotations within that data from PTC and converting them into unique identifiers according the respective bioentity class. ChexMix allows Boolean operators (' AND' , 'OR' , 'NOT'), double quotes for phrases, and asterisk for truncated terms for PubMed literature search. Each bioentity extracted from ChexMix is classified within more general categories of bioentity and arranged in a hierarchical structure.
When single or multi keywords of interest are entered in ChexMix, bioentities in all citations that have keywords are retrieved and automatically mapped into unique identifiers. The search results indicate the cooccurrence of bioentities in the available literature, allowing to link them and yielding the co-occurrence network. www.nature.com/scientificreports/ ChexMix makes the process straightforward by managing data access from multiple sources and providing functions to manipulate the network data structure. The analysis is mainly focused on taxonomy terms to inspect the species that biologically affect physiological disorders or diseases within the network. Each taxonomy name in the search results is listed in a hierarchical form. Trivial bioentities are located on the higher ranks of the list. Other near species within the obtained taxonomic tree are expected to have similar biological effects, representing potential alternative biomedical options. ChexMix can also generate the connections between taxonomic terms and MeSH identifiers, which are located under 'Diseases [C]' and 'Chemicals and Drugs [D]' , in the same literature. MeSH identifiers co-occurring with a taxonomic term in the literature are expected to have a close relationship.
In Fig. 4, the intersection set of MeSH terms co-occurring with each taxonomy keyword is highlighted on the whole network resulting from the union set of two networks. Networks generated from a single keyword in ChexMix can be simply reprocessed by the combination of set operations, such as union, difference, and intersection with other networks. The re-organization of complex networks from single or multi keywords provides new insights or clues for bioentities in PubMed, the biggest biomedical database.
In the present study, we have focused on how to use ChexMix to construct a taxonomic tree or a co-occurrence network from multi keywords, and analyze the networks from bioentities identified by PTC. We designed ChexMix for easily adapting the diverse types of bioentities and integrating other existing databases as well as recently introduced state-of-art text mining systems 28 . We hope ChexMix will be utilized for other researchers to integrate other datasets, and manipulate and visualize the relationships between bioentities. www.nature.com/scientificreports/

Methods
Data processing. ChexMix currently obtains biomedical data from multiple databases using their web application programming interface (APIs) or bulk data files. For example, Entrez API allows to query Entrez databases, such as PubMed and PubMed Central (PMC), using combinations of keywords. PTC provides a web API to fetch annotations of biomedical concepts, such as taxonomy and MeSH identifiers, in a publication. ChexMix also manages to download and parse bulk data files from biomedical databases 29 . For example, Chex-Mix loads the data from PTC, including NCBI taxonomy and MeSH that inherently have relationships between entities therein, and transforms it into internal network data structures. ChexMix also grants the possibility to construct, manipulate, and simplify the network data structures.
Bioentities extraction and visualization. The keywords of interest can be input as single words or phrases. The results are output in hierarchical tree format according to their own taxonomic or hierarchicallyorganized rules for each type of bioentity. In the case of taxonomy information, species names in the literature are encoded into unique identifiers, TaxID, and hierarchically re-organized in the classification rules of NCBI taxonomy. In the present study, hierarchical results were applied to discover relevant species with lower taxonomic ranks (family and genus levels) using the list of the Korean medicinal plants of the Korea Plant Extract Bank (KPEB). The results were visualized in the network format using the Gephi software (ver. 0.9.2). High-performance liquid chromatography (HPLC) analysis. The samples were analyzed on a 1260 quaternary pump, an autosampler, and a multiple wavelength detector (Agilent Technologies, Santa Clara, CA, USA). Chromatographic separation was performed using a Hector-M C18 column (250 × 4.6 mm I.D.; 5 μm, RSTech, Daejeon, Korea). The ultraviolet detector was set at a wavelength of 260 nm. The mobile phase was a gradient solvent system consisting of solvent A (0.1% formic acid in water) and solvent B (MeOH) as follows: isocratic 95% A (0-10 min), linear gradient 95-80-30% A (10-20-30 min) and isocratic 30% A (30-40 min). The flow rate was 1.0 mL/min, and aliquots of 10 μL were injected using an autosampler.